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Abstract 

Loops in proteins are flexible regions connecting regular secondary structures. They are often involved in protein functions 
through interacting with other molecules. The irregularity and flexibility of loops make their structures difficult to determine 
experimentally and challenging to model computationally. Conformation sampling and energy evaluation are the two key 
components in loop modeling. We have developed a new method for loop conformation sampling and prediction based on 
a chain growth sequential Monte Carlo sampling strategy, called Distance-guided Sequential chain- Gro wth IVlonte Carlo 
(DiSGro). With an energy function designed specifically for loops, our method can efficiently generate high quality loop 
conformations with low energy that are enriched with near-native loop structures. The average minimum global backbone 
RMSD for 1,000 conformations of 12-residue loops is 1.53 A, with a lowest energy RIVISD of 2.99 A, and an average ensemble 
RMSD of 5.23 A. A novel geometric criterion is applied to speed up calculations. The computational cost of generating 1 ,000 
conformations for each of the x loops in a benchmark dataset is only about 10 cpu minutes for 12-residue loops, compared 
to ca 180 cpu minutes using the FALCm method. Test results on benchmark datasets show that DiSGro performs 
comparably or better than previous successful methods, while requiring far less computing time. DiSGro is especially 
effective in modeling longer loops (10-17 residues). 
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Introduction 

Protein loops connect regular secondary structures and are 
flexible regions on protein surface. They often play important 
functional roles in recognition and binding of small molecules or 
other proteins [1-3]. The flexibility and irregularity of loops make 
their structures difficult to resolve experimentally [4]. They are 
also challenging to model computationally [5,6]. Prediction of loop 
conformations is an important problem and has received 
considerable attention [5-27]. 

Among existing methods for loop prediction, template-free 
methods build loop structures de novo through conformational 
search [5-7,9,10,13,14,17,18,21,23,28]. Template-based meth- 
ods build loops by using loop fragments extracted from 
known protein structures in the Protein Data Bank [11,19,27]. 
Recent advances in template-free loop modeling have enabled 
prediction of structures of long loops with impressive accuracy 
when crystal contacts or protein family specific information 
such as that of GPCR family is taken into account [14,23, 
25]. 



Loop modeling can be considered as a miniaturized protein 
folding problem. However, several factors make it much more 
challenging than folding small peptides. First, a loop conforma- 
tion needs to connect two fixed ends with desired bond lengths 
and angles [8,12]. Generating quality loop conformations 
satisfying this geometric constraint is nontrivial. Second, the 
complex interactions between atoms in a loop and those in its 
surrounding make the energy landscape around near-native loop 
conformations quite rugged. Water molecules, which are often 
implicitly modeled in most loop sampling methods, may 
contribute significantly to the energetics of loops. Hydrogen 
bonding networks around loops are usually more complex and 
difficult to model than those in regular secondary structures. 
Third, since loops are located on the surface of proteins, 
conformational entropy may also play more prominent roles in 
the stability of near-native loop conformations [29,30]. Ap- 
proaches based on energy optimization, which ignore backbone 
and/ or side chain conformational entropies, may be biased 
toward those overly compact non-native structures. Despite 
extensive studies in the past and significant progress made in 
recent years, both conformational sampling and energy evalua- 
tion remain challenging problems, especially for long loops {e.g., 
«>12). 
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Author Summary 

Loops in proteins are flexible regions connecting regular 
secondary structures. They are often involved in protein 
functions through interacting with other molecules. The 
irregularity and flexibility of loops make their structures 
difficult to determine experimentally and challenging to 
model computationally. Despite significant progress made 
in the past in loop modeling, current methods still cannot 
generate near-native loop conformations rapidly. In this 
study, we develop a fast chain-growth method for loop 
modeling, called Distance-guided Sequential chain-Growth 
Monte Carlo (DiSGro), to efficiently generate high quality 
near-native loop conformations. The generated loops can 
be used directly for downstream applications or as 
candidates for further refinement. 

In this paper, we propose a novel method for loop sampling, 
called Distance-guided Sequential chain-Growth Monte 
Carlo (DiSGro). Based on the principle of chain growth 
[15,31,32,34,35], the strategy of sampling through sequentially 
growing protein chains allows efficient exploration of conforma- 
tional space [15,34-37]. For example, the Fragment Regrowth via 
Energy-guided Sequential Sampling (FRESS) method outper- 
formed previous methods on folding benchmark HP sequences 
[15,33]. In addition to HP model [15], sequential chain-growth 
sampling has been used to study protein packing and void 
formation [35], side chain entropy [29,38], near-native protein 
structure sampling [30], conformation sampling from contact 
maps [39], reconstruction of transition state ensemble of protein 
folding [40], RNA loop entropy calculation [37], and structure 
prediction of pseudo-knotted RNA molecules [41]. 

In this study, we first derive empirical distributions of end-to- 
end distances of loops of different lengths, as well as empirical 
distributions of backbone dihedral angles of different residue types 
from a loop database constructed from known protein structures. 
An empirical distance guidance function is then employed to bias 
the growth of loop fragments towards the C-terminal end of the 
loop. The backbone dihedral angle distributions are used to 
sample energetically favorable dihedral angles, which lead to 
improved exploration of low energy loop conformations. Compu- 
tational cost is reduced by excluding atoms from energy 
calculation using REsidue-residue Distance Cutoff and ELLipsoid 
criterion, called RedceU. Sampled loop conformations, all free of 
steric clashes, can be scored and ranked efficiently using an atom- 
based distance-dependent empirical potential function specifically 
designed for loops. 

Our paper is organized as follows. We first present results for 
structure prediction using five different test data sets. We show that 
DiSGro has significant advantages in generating native-like loops. 
Accurate loops can be constructed by using DiSGro combined 
with a specifically designed atom-based distance-dependent 
empirical potential function. Our method is also computationally 
more efficient compared to previous methods [8,9,18,22,42]. We 
describe our model and the DISGRO sampling method in detail 
at the end. 

Results 

Test set 

We use five data sets as our test sets. Test Set 1 contains 10 
loops at lengths four, eight, and twelve, for a total of 3 x 10 = 30 
loops from 21 PDB structures, which were described in Table 2 
of zRef [8]. Test Set 2 consists of 53 eight, 17 eleven, and 10 



twelve-residue loops from Table CI of Ref [42]. Several loop 
structures were removed as they were nine-residue loops but 
mislabeled as eight-residue loops: (lawd, 55-63; Ibyb, 246-254; 
and Iptf, 10-18). Altogether, there are 50 eight-residue loops. Test 
Set 3 is a subset of that of [5] , which was used in the RAPPER and 
FALCm studies [10,22]. Details of this set can be found in the 
"Fiser Benchmark Set" section of Ref. [10]. Test Set 4 is taken 
from Table A1-A6 of Ref [42]. Test Set 5 contains 36 fourteen, 
30 fifteen, 14 sixteen and 9 seventeen-residue loops from Table 3 
of Ref [23]. Test Set 1 and 2 are used for testing the capability of 
DiSGro and other methods in generating native-like loops. Test 
Set 3, 4, and 5 are used for assessing the accuracy of predicted 
loops based on selection from energy evaluation using our atom- 
based distance-dependent empirical potential function. Our results 
are reported as global backbone RMSD, calculated using the N, 
Cc, C and O atoms of the backbone. 

Loop sampling 

To evaluate our method for producing native-like loop 
conformations, we use Test Set 1 and 2. 

We generate 5,000 loops for each of the 10 loop structures in 
Test Set 1 at length 4, 8, and 12 residues, respectively. We 
compare our results with those obtained by CCD [8], CSJD [12], 
SOS [18], and FALCm [22]. The minimum RMSD among 5,000 
sampled loops generated by DiSGro are listed in Table 1, along 
with results from the four other methods. 

Accurate loops of longer length are more difficult to generate. 
For loops with 12 residues, DiSGro generates more accurate loops 
than other methods. Our method has a mean of 1.53 A for the 
minimum RMSD, compared to 1.81 A for FALCm, the next best 
method in the group [22]. The minimum RMSD of nine of the ten 
12-residue loops have RMSD < 2 A, while five loops of the ten 
generated by FALCm have RMSD > 2 A. Compared to the CCD, 
CSJD, and SOS methods, our loops have significandy smaller 
minimum RMSD (1.53 A v.s 3.05, 2.34, and 2.25 A, respectively. 
Table 1). The average minimum global backbone RMSD for 12- 
residue loops can be further improved when we increase the 
sample size of generated loop conformations. The minimum 
global RMSD is improved to 1.45 A, 1.26 A, and 0.96 A when 
the sample size is increased to 20,000, 100,000, and 1,000,000, 
respectively. Further improvement would likely require ffexible 
bond lengths and angles. 

For loops with 8 residues, DiSGro has an average minimum 
RMSD value smaller than the CCD, CSJD, and SOS methods 
(0.81 A vs 1.59 A, 1.01 A, and 1.19 A, respectively. Table 1). In 
eight of the ten 8-residue loops, DiSGro achieves sub-angstrom 
accuracy (RMSD < 1 A), although the mean of minimum RMSD 
of 8-residue loops is slightly larger than that from FALCm (0.80 A 
vs 0.72 A). 

For loops with 4-residue, the mean of the minimum RMSD 
(0.21 A) by DiSGro is significantly smaller than those by die 
CSJD and the CCD methods (0.40 A and 0.56 A, respectively), 
and is similar to those by the SOS and FALCm methods(0.20 A 
and 0.22 A, respectively). Noticeably, three of the ten loops have 
RMSD < 0.1 A, indicating our sampling method has good 
accuracy for short loop modeling. 

These loops can be generated rapidly. The computing time per 
conformation averaged over 5,000 conformations for 4, 8, and 12- 
residues is 4.4, 13, and 2Qms using a single AMD Opteron 
processor of 2 GHz. In addition to improved average minimum 
RMSD, DiSGro seems to take less time than CCD (31, 37, and 
2'ims on an AMD 1800-H MP processor for the 4, 8, and 12- 
residue loops), and is as efficient as SOS (5.0, 13, and \9ms for the 
4, 8, and 12-residue loops on an AMD 1800H- MP processor). 
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Table 1. Minimum bacl<bone RMSD values of the loops sampled by five different algorithms. 



Length 


Loop 


CCD 


CSJD 


SOS 


FALCm 


DiSGro 


1 2-res 


1cruA_358 


2.54 


2.00 


2.39 


2.07 


1.84 




lctqA_26 


2.49 


1.86 


2.54 


1.66 


1.36 




ld4oA_88 


2.33 


1.60 


2.44 


0.82 


1.50 




ld8wA_46 


4.83 


2.94 


2.17 


2.09 


1.17 




ldslA_282 


3.04 


3.10 


2.33 


2.10 


1.82 




ldysA_291 


2.48 


3.04 


2.08 


1.67 


1.45 




leguA_508 


2.14 


2.82 


2.36 


1.71 


2.13 




lf74A_ll 


2.72 


1.53 


2.23 


1.44 


1.46 




lqlwA_31 


3.38 


2.32 


1.73 


2.20 


0.79 




lqopA_178 


4.57 


2.18 


2.21 


2.36 


1.77 




Average 


3.05 


2.34 


2.25 


1.81 


1.53 


8-res 


lcruA_85 


1.75 


0.99 


1.48 


0.62 


1.34 




laqA_144 


1.34 


0.96 


1.37 


0.56 


0.70 




ld8wA_334 


1.51 


0.37 


1.18 


0.96 


0.93 




ldslA_20 


1.58 


1.30 


0.93 


0.73 


0.62 




lgk8A_122 


1.68 


1.29 


0.96 


0.62 


1.08 




li0hA_145 


1.35 


0.36 


1.37 


0.74 


0.80 




1 ixh_l 06 


1.61 


2.36 


1.21 


0.57 


0.39 




llam_420 


1.60 


0.83 


0.90 


0.66 


0.63 




lqopB_14 


1.85 


0.69 


1.24 


0.92 


0.87 




3chbD_51 


1.66 


0.96 


1.23 


1.03 


0.67 




Average 


1.59 


1.01 


1.19 


0.72 


0.80 


4-res 


ldvjA_20 


0.61 


0.38 


0.23 


0.39 


0.31 




ldysA_47 


0.68 


0.37 


0.16 


0.20 


0.09 




leguA_404 


0.68 


0.36 


0.16 


0.22 


0.39 




1ejOA_74 


0.34 


0.21 


0.16 


0.15 


0.09 




liOhA_123 


0.62 


0.26 


0.22 


0.17 


0.13 




1 idOA_405 


0.67 


0.72 


0.33 


0.19 


0.33 




lqnrA_195 


0.49 


0.39 


0.32 


0.23 


0.19 




1 qopA_44 


0.63 


0.61 


0.13 


0.30 


0.39 




ltca_95 


0.39 


0.28 


0.15 


0.09 


0.11 




lthfD_121 


0.50 


0.36 


0.11 


0.21 


0.05 




Average 


0.56 


0.40 


0.20 


0.22 


0.21 



Minimum bacl<bone RMSD values of the loops sampled by CCD, CSJD, SOS, FALCm and 
[8]. CSJD result was obtained from Table 1 of Ref. [12]. SOS result was obtained from 
doi:l 0.1 371 /journal.pcbi.1 003539.t001 



DiSGro for different loop structures. CCD result was obtained from Table 2 of Ref. 
Table 1 of Ref. [18]. FALCm result was obtained from Table 2 of Ref. [22]. 



Reducing the number of trial states in DiSGro can further 
reduce the computing time, with some trade-off in sampling 
accuracy. For example, when we take (m,«) = (10,2), tlie 
computing time per conformation averaged over 5,000 confor- 
mations for 4, 8, and 12-residues is only 3.5, 5.0, and 5.8 ms, 
respectively, with the average minimum RMSDs comparable to 
those from SOS's (0.29 A nf 0.20 A, 1.15 A 1.19 A, and 2.24 A 
vs 2.25 A for the 4, 8, and 12-residue loops, respectively). Although 
the CSJD loop closure method has faster computing time (0.56, 
0.68, and 0.72 ms on AMD 1800+ MP processor), the speed of 
DiSGro is adequate in practical applications. 

We compare DiSGro in generating near-native loops with 
Wriggling [43], Random Tweak [44], Direct Tweak [42,45], 
LOOPYbb [45], and PLOP-build [13] using Test Set 2. The 
minimum RMSD among 5,000 loops generated by DiSGro are 
listed in Table 2, along with results from the other methods 



obtained from Table 2 in Ref [42]. Direct Tweak and LOOPYbb 
from the LoopBuUder method and our DiSGro have better 
accuracy in sampling than Wriggling, Random Tweak, and 
PLOP-buUd methods. For loops with 11 and 12-residues, these 
three methods are the only ones that can generate near-native loop 
structures with minimal RMSD values below 2 A. Among these, 
DiSGro outperforms LOOPYbb in generating loops at all three 
lengths: the average minimal RMSD (-Rmm) is 1 -28 A vs. 1 .80 A for 
length 12, 1.19 A vs. 1.51 A for length 11, and 0.80 A vs. 0.89 A 
for length 8, respectively. Compared to the Direct Tweak sampling 
method, DiSGro has improved R^am for 12-residue loops (1.28 A 
vs 1.48 A), slightly improved R,nm for 11-residue loops (1.19 A vs 
1.20 A) and inferior R^i„ for 8-residue loops (0.80 A vs 0.69 A). 
Overall, these results show that DiSGro are very effective in 
sampling near-native loop conformations, especially when mod- 
eling longer loops of length 1 1 and 12. 



PLOS Computational Biology | www.ploscompbiol.org 



3 



April 2014 I Volume 10 | Issue 4 | el 003539 



Sampling and Structure Prediction of Protein Loops 

Our DiSGro method can generate accurate loops and has 
significant advantages for longer loops compared to previous 
methods. Using RMSD values calculated from three backbone 
atoms N, C^, and C for all loop lengths lead to the same 
conclusion. 

Loop structure prediction and energy evaluation 

To assess the accuracy of loops selected by our specifically 
designed atom-based distance-dependent empirical potential 
function, we test DiSGro using Test Set 3 and foUow the 
approach of reference [22] for ease of comparison. Because of the 
high content of secondary structures, these loops are very 
challenging to model. In the study of [22], 1,000 backbone 
conformations with the best scores evaluated by DFIRE potential 
function [46] were retained after screening 4,000 generated 
backbone conformations for each loop. Loop closure and steric 
clash removal were not enforced to the 4,000 conformations. We 
follow the same procedure, except the DFIRE potential function is 
replaced by our atom-based distance-dependent empirical poten- 
tial function. The ensemble of the selected 1,000 backbone 
conformations are then subjected to the procedure of side-chain 
construction as described in the Section "Side-chain modeling and 
steric clash removal". The loop conformations with full side-chains 
are then scored and ranked by the atom-based distance-dependent 
empirical potential function. Our results are summarized in 
Table 3. 

We measure the average minimum backbone RMSD R^i„, the 
average ensemble RMSD i^ave, and the average RMSD of the 
lowest energy conformations REmin of the 1,000 loop ensemble 
with the same length. Overall, DiSGro performs significantiy 
better than FALCm and RAPPER in i^m/n, Rmv and REmin for all 
loop lengths. Compared to FALCm, DiSGro shows significant 
advantages in R^i„ on sampling long loops of 10-12 residues. Our 
method has R,rim of 1. 15 A compared to 1.45 A for 10-residue 
loops, 1.39 A compared to 1.47 A for 11-residue loops, and 
1.53 A compared to 1.74 A for 12-residue loops, respectively. For 
example, as can be seen in Figure 1 , the lowest energy loop (red) of 
a 12-residue loop in the protein Iscs (residues 199-210) has a 
0.9 A RMSD to the native structure (white). The generated top 
five lowest energy loops are all very close to the native loop, yet are 
diverse among themselves. 

DiSGro also generates loops with smaller i^av? compared to 
FALCm in loops with length ranging from 4 to 12, indicating 
DiSGro can generate ensemble of loop conformations with 
enriched near native conformations. Furthermore DiSGro 
achieves better modeling accuracy using the atom-based dis- 
tance-dependent empirical potential function. Compared to 
FALCm, DiSGro has a RE„,m of 1.72 A w 1.87 A for 8-residue 
loops, 1.82 A vs 2.08 A for 9-residue loops, 2.33 A vs 3.09 A for 
lO-residue loops, 2.98 A vs 3.43 A for Il-residue loops, and 
2.99 A vs 3.84 A for 12-residue loops, respectively. 

DiSGro is also much faster than other methods. The reported 
typical computational cost of FALCm is 180 cpu minutes for 8-12 
residue loops on a Linux server of a 2.8 GHz 2-core Intel Xeon 
processor [47] . The computation cost for DiSGro method is only 
6 and 10 cpu minutes for 10 and 12-residue loops on a single 
2 GHz AMD Opteron processor, respectively. In addition, 
FALCm has a size restriction, and it only works with proteins 
with < 500 residues. In contrast, the overall protein size has no 
effect on the computational efficiency of DiSGro since the 
numbers of atoms for energy calculation that are retained by the 
ellipsoid criterion are bounded. 

The LOOPER method is an accurate and efficient loop 
modeling method using a minimal conformational sampling 
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method combined with energy minimization [17]. The test set 
used in the LOOPER study is tlie original Fiser data set without 
removal of any loops. Therefore, it is different from Test Set 3 
used in the RAPPER and FALCm studies [10,22]. For ease of 
comparison, we compare DiSGro to the LOOPER using the 
test set with 10-12-residue loops from [17]. Our results are 
summarized in Table 4. 

We denote i^B/tA,ave and Rskh,med as the mean and median of 
backbone RMSD of the lowest energy conformations with the 
same loop length. Similarly, we use i^A(m,ave, and RAim,me<i to 
denote the mean and median RMSD values of aU-heavy atoms. 
DiSGro shows improved prediction accuracy compared to 
LOOPER in both backbone and all-heavy atom RMSD. For 
the 40 loops of length 12, Rskb, ave is 3.20 A compared to 4.08 A, 
while the median RBkb,med is 2.39 A compared to 3.80 A. It also 
has better all-heavy atom RMSD of 3.39 A/ 3. 18 A (mean/ 
median), compared to 3.58 A/3.35 A for 10-residue loops, 
3.58 A/3.30 A compared to 4.30 A/3.60 A for 11-residue loops, 
and 4.18 A/3.60 A compared to 5.22 A/4.96 A for 12-residue 
loops. 

It is worth noting that DiSGro outperforms LOOPER in speed 
as well. For a loop with 10 residues, the time cost of DiSGro is 6 
minutes using a 2 GHz CPU versus 40 cpu minutes using a 3 GHz 
processor according to Figure 7 in the LOOPER paper [17]. 

Prior publications also allowed us to compare results in loop 
structure predictions based on energy discrimination using Test 
Set 4 with results obtained using the LoopBuUder method [42]. 
Following [42], we generated 1,000 closed loop conformations for 
eight-residue loops, 2,000 for nine-residue loops, 5,000 for ten, 
eleven, and twelve-residue loops, and 8,000 for thirteen-residue 
loops. Energy calculations are carried out using our atom-based 
distance-dependent empirical potential function. The average 
RMSD of the lowest energy conformations, REmin, are then 
compared between these two methods. The results are summa- 
rized in Table 5. 

Compared to LoopBuUder, DiSGro has better REmin' 1-83 A vs 
1.88 A for 9-residue loops, 1.83 A vs 1.93 A for lO-residue loops, 
2.38 A m 2.50 A for Il-residue loops, 2.62 A vs 2.65 A for 12- 
residue loops, and 3.26 A vs 3.74 A for 13-residue loops, 
respectively. DiSGro has inferior performance in selecting REmin 
for 8-residue loops (1.59 A vs 1.31 A). The average time using 
LoopBuUder for twelve-residue loops was around 4.5 hours or 
270 minutes, while the computational time using DiSGrcj is 
around 10 minutes. Overall, DiSGro has equal or slightly better 
performance than LoopBuilder in average prediction accuracy of 
loop structures with far less computing time. 

To test the feasibility of DiSGro in modeling longer loops with 
length > 12, we use the Fiser 13-residue loops data set to generate 
and select low energy loop conformations. 1,000 conformations 
with low energy are obtained. The mean of minimum backbone 
RMSD R^i„ of 40 loops with 13-residue is 1.76 A, and the median 
is 1.61 A. The mean/median of the backbone RMSD RBkh,Emin, 
and all heavy atom RJvISD RhtmMmin of the lowest energy 
conformations are 2.91 A/2.53 A and 3.84 A/3.29 A, respective- 
ly (Table 6). 

With extensive conformational sampling using molecular 
mechanics force field, the Protein Local Optimization Program 
(PLOP) can predict highly accurate loops [13,14,23]. We tested 
DiSGro using Test Set 5 consisting of 89 loops with length 14-17 
and compared results with those using PLOP. Here the sampling 
5 and scoring processes were similar to those used in Test Set 3, 

except 100,000 backbone conformations were generated. We 
measured the average minimum backbone RMSD R„ 
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Figure 1. Top five lowest energy loops of length 12 for single-metal-substituted concanavalin A (pdb Iscs, residues 199-210). Tine 
lowest energy loop after side-chain construction is colored in red, and the native structure is in white. 
doi:1 0.1 371/journal.pcbi.1 003539.g001 
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Table 4. Comparison of accuracy of modeled loops 


using the original Fiser data set of loops with 10-12 residues. 




Length 


Targets 


DiSGro/LOOPER 














f^Bkb,med 






10 


40 


2.30/2.66 


2.20/2.39 


3.39/3.58 


3.18/3.35 


11 


40 


2.63/3.35 


2.25/2.76 


3.58/4.30 


3.30/3.60 


12 


40 


3.20/4.08 


2.39/3.80 


4.18/5.22 


3.60/4.96 



The accuracy achieved by LOOPER and DiSGro at different loop length using the original Fiser data set of loops with 10-12 residues is listed. 7?baa,«vc« and RBkh,t7 
denote the mean and median of backbone RMSD, while RAwi.im'< and RKini.rm-tt denote the mean and median of all-heavy atoms RMSD of the lowest energy 
conformations with the same loop length. 
doi:l 0.1 371 /journal.pcbi.1 003539.t004 



average RMSD of the lowest energy conformations REmin- Our 
results are summarized in Table 7. 

Loops predicted by the PLOP method have smaller Re^^^ 
compared to DiSGro [23], although DiSGro samples well and 
gives small i^min of 1.58 A for 14-residue loops, 1.80 A for 15- 
residue loops, 1.88 A for 16-residue loops, and 2.18 A for 17- 
residue loops. For loops of length 17, the Rmin of 2.18 A is less 
than the reported i?£„„„ = 2.30 A using PLOP, although it is 
unclear whether the i^mm of loops generated by PLOP is less than 
2.18 A. Overall, DiSGro is capable of successfully generating high 
quality near-native long loops, up to length 17. The accuracy of 
REmin of loops generated by DiSGro may be further improved by 
using a more effective scoring function. 

We also compared the computational costs of the two methods. 
The average computing time for DiSGro is 0.73, 0.72, 0.81, and 
0.95 hours for loops of lengths 14, 15, 16, and 17 using a single 
core AMD Opteron processor 2350, respectively, which is more 
than two orders of magnitude less than the time required for the 
PLOP method (216.0, 309.6, 278.4, and 408.0 hours for loops of 
length 14, 15, 16, and 17 residues, respectively). 

Improvement in computational efficiency 

We used a REsidue-residue Distance Cutoff and ELLipsoid 
criterion (Redcell) to improve the computational efficiency. To 
assess the effectiveness of this approach, we carry out a test using a 
set of 140 proteins (see discussion of the tuning set in Materials and 
Methods). We compared the time cost of energy calculation of 
generating a single loop, with and without this procedure. When 
the procedure is applied, we only calculate the pairwise atom-atom 
distance energy between atoms in loop residues and other atoms 
within the ellipsoid. When the procedure is not applied, we 



calculate energy function between atoms in loop residues and all 
other atoms in the rest of the protein. The computational cost of 
energy calculations for sampling single loops with 12 and 6- 
residues are shown in Figure 2A and Figure 2B, respectively. 

From Figure 1, we can see that significant improvement in 
computational cost is achieved. The average time cost using our 
procedure is reduced from 82.3 to 6.0 ms for sampling 12- 
residue loops, and 39.4 ms to 2.0 ms for 6-residue loops. In addition, 
this approach makes the time cost of energy calculations indepen- 
dent of the protein size (Figure 2A and Figure 2B), whereas the 
computing time without applying this procedure increases linearly 
with the protein size. The improvement is especially significant for 
large proteins. For example, to generate a 15-residue loop in a 
protein with 1,114 residues, the computing time is improved from 
93.7 ms to 1.8/JiJ, which is more than 50-fold speed-up. Detailed 
examination indicates that both distance cutoff and the ellipsoid 
criterion contribute to the computational efficiency. Furthermore, 
the full RedceU procedure has improved efficiency over using either 
"Ellipsoid Criterion Only" or "Cutoff Criterion Only". The 
computing time for generating a 15-residue loops is 2.0 ms when 
the fuU Redcell procedure is applied, compared to 5.3 ms, and 
3.9ms, when only the ellipsoid criterion and only the distance- 
threshold are used, respectively (Figure 2C). Furthermore, there is 
no loss of accuracy in energy evaluation. Overall, RedceU improves 
the computational cost by excluding many atoms from collision 
detections and energy calculations, with significant reduction in 
computation time, especially for large proteins. 

Discussion 

In this study, we presented a novel method Distance-guided 
Sequential chain-Growth Monte Carlo (DiSGro) for generating 



Table 5. Comparison of REmin of the loop conformations sampled by Loop Builder and DiSGro using Test Set 4 taken from the 
Loop Builder study [42]. 



Average prediction accuracy (/?£>„/„) 



Length 


# of Targets 


LoopBuilder 


DiSGro 


8 


63 


1.31 


1.59 


9 


56 


1.88 


1.83 


10 


40 


1.93 


1.83 


11 


54 


2.50 


2.38 


12 


40 


2.65 


2.62 


13 


40 


3.74 


3.26 



REmin denote the average RMSD of the lowest energy conformations of the loop ensemble. Results of LoopBuilder were obtained from Table 5 of Ref [42]. 
doi:l 0.1 371/journal.pcbi.l 003539.t005 



PLOS Computational Biology | www.ploscompbiol.org 



7 



April 2014 I Volume 10 | Issue 4 | el 003539 



Sampling and Structure Prediction of Protein Loops 



CO c^ ^ 



ro m vo rs m m rN 
CO ro vo 00 I— m CO 



rN rN L/^ 



o\ o uo 00 o ^ 

Ol ^ LT) r-_ 0\ ^ 

uo .-^ 7-* t-^ rN 



O vO 00 \D 



rN vo ro 



CO UO 00 



m ^ vo ^ 



OJ 
-D 



I- CO ^ 



\o 00 m ro 



Q. 

o 

O 

o 

OJ 
l/l 

to 

to 
T3 

i~- 



g 
o 

OJ 



-c 



Q. 



s ^ 



"5 



1 1 

°" a. 



p a ^ 

O) 

CD ^ Ti 



1 I g; f 2 



o 

cc 



rN t— t— 



in r- 

I— rN 

rvi t— 



0^ ^ VD 



Q. 
O 

_g 
-a 

01 
01 

-a 
o 
E 



rvi 7— .— 



in ro m rvi ui rN 



m 
O 
a. 



42 t i oi 



rN fN rN fN 



n 



01 

(5 



ro ro ro ro 



PLOS Computational Biology | www.ploscompbiol.org 



8 



April 2014 I Volume 10 | Issue 4 | el 003539 



Sampling and Structure Prediction of Protein Loops 

protein loop conformations and predicting loop structures. 
Ensembles of near-native loop conformations can be efficiently 
generated using the DiSGro method. DiSGro has better average 
minimum backbone RMSD, i^niiiu compared to other loop 
sampling methods. For example, i^min is 1-53 A for 12-residue 
loops when using DiSGro, while the corresponding values are 
3.05 A, 2.34 A, 2.25 A, and 1.81 A when using the CCD, CSJD, 
SOS, and the FALCm method. 

DiSGro also performs well in identifying native-like conforma- 
tions using atom-based distance-dependent empirical potential 
function. In comparison with other similar loop modeling 
methods, DiSGro demonstrated improved modeling accuracy, 
in terms of an average RMSD of the lowest energy conformations 
REmin for the more challenging task of sampling longer loops of 
10-13 residues. For example, DiSGro outperforms FALCm [22] 
(2.33 A vs 3.09 A) and LOOPER [17] (2.30 A vs l.bd k) in 
predicting 10-residue loops, while taking less computing time 
(6 minutes vs 180 minutes for FALCm and 40 minutes for 
LOOPER. Compared to LoopBuilder [42], DiSGro also has 
better REmin'- For 13-residue loops, the REmin is 3.26 A using 
DiSGro, but is 3.74 A when using the Loop Builder. The average 
computing time is also faster when using DiSGro: it takes about 6 
minutes to predict structures of 10-residue loops and 10 minutes 
for 12-residue loops. DiSGro also works well for short loops, 
although this may be largely a reflection of the underlying 
analytical closure method [12]. 

There are a number of directions for further improvement. 
DiSGro can be further improved by adding fragments of peptides 
when growing loops instead of adding individual residues. 
Fragment-based approach has been widely used in protein 
structure prediction [48-51] and specifically in loop structure 
prediction [21]. It is straightforward to apply the strategy 
described in this study for fragment-based growth, and it wiU 
likely lead to improved sampling efficiency further and enable 
longer loops to be modeled. Furthermore, the energy function 
employed here can be further improved by optimization such as 
those obtained by training with challenging decoy loops using 
nonlinear kernel [52], and/or using rapid iterations through a 
physical convergence function [53,54]. In addition, DiSGro is 
compatible with different loop closure methods [8,12,22], and 
experimenting with other closure strategy may also lead to further 
improvement. 

An efficient loop sampling method such as DiSGro can help to 
improve overall modeling of loop structures. Currendy, the 
hierarchical approach of the Protein Local Optimization Program 
(PLOP) [13,14,23] gives excellent accuracy in protein loop 
modeling, but requires significant computational time. The 
average time cost of modeling a 13-residue loop is about 4—5 
days [23]. Kinematic closure (KIC) method can also make very 
accurate predictions of 12-residue loops [21]. However, KIC also 
requires substantial computation, with about 320 CPU hours on a 
single 2.2 GHz Opteron processor for predicting 12-residue loops 
[21]. As suggested earlier by Spassov et al [17], an efficient loop 
modeling method combined with energy minimization may 
overcome the obstacle of high computational cost. By generating 
high quality initial structures using DiSGro, near native confor- 
mations of loops can be used as candidates for further refinement. 

Materials and Methods 

Protein structures representation 

All heavy atoms in the backbone and side chain of a protein 
loop are explicitly modeled. The bond lengths b and angles 9 are 
taken from standard values specific to residue and atom type [55]. 
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The backbone dihedral angles (<l>,\li,ai) and side chain dihedral 
angles x constitute all the degrees of freedom (DOFs) in our model. 

Distance-guided Sequential chain-Growth Monte Carlo 
(DiSGro) 

In order to efficiently generate adequate number of native-like 
loop conformations, we have developed a Distance-guided 
Sequential chain-Growth Monte Carlo (DiSGro) method. 

Let the loop to be modeled begins at residue t and ends at 
residue /. The sequence of the positions of backbone heavy atoms 
from C atom of residue t to Ca {CA) atom of residue / are 
unknown and need to be generated. We assume that the backbone 
atoms before and after this fragment are known. Coordinates of 
side chain atoms are also unknown and need to be generated if the 
coordinates of the CA atoms they are attached to are unknown. 

At each step of the chain growth process, we generate three 
consecutive backbone atoms continuing from the backbone atom 
sampled at the previous step. At the {i — t)-xh growth step {t<i< I), 
the three backbone atoms are C atom of residue /, atom of 
residue and atom of residue /+1 (Figure 3). The 

coordinates of the three atoms, C, , A'^z+i and Cy4,+i, are denoted 
as Xc.i, XAr_/+i, and ncAj+i, respectively. The co dihedral angles 
that determine the coordinate of Co. atoms are sampled from a 
normal distribution with mean 1 80° and standard deviation 4° . In 
the next section, we describe in detail in sampling of the dihedral 
angles (<^,i/'), which determine the coordinates of the C and the N 
atoms. 

Sampling backbone (^,^) angles. Without loss of general- 
ity, we describe the sampling procedure for C, and A^z+i atoms at 
the (i—t)-th growth step. C, is generated frrst, followed by Nj+i. 
Denote the distance between Xcaj and xcj as 
c/c/4i,c, = |xc,/ — xc^_,|, and the distance between Xc,, and xcj as 
£/c,,c, = |xc,; — xc,/|. Since the bond angle 9cj formed by the 
A^, — CAi and CA, — C,- bonds is frxed, and the bond length hcA,,c, 
is also fixed, C; wiU be located on a circle Cc (Figure 3): 



Cc = {xeR^'|such that||x-xc^,,|| = 6cx,c,- 

(1) 

and (x-xc^,,)-(xc^,, -x^v,,) = cos Ocj}. 

Given a fixed dc-^Cp C, can be placed on two positions Xcj and 
xc'.i on circle Cc (Figure 3, xcj and Xcj are labeled as C, and 
C'i, respectively.) As the probability for placing C, on either 
position is about equal based on our analysis, we randomly select 
one position to place atom C,-. 

In principle, sampling from the empirical distributions of dQ,c, 
and mapping back to C,- should encourage the growth of loops to 
connect to the terminal C/ atom. Further analysis of the empirical 
distribution of ^^?c„c, given dcA,.c, shows that dcAi.Ci be very 
informative for sampling ii?c„C; in some cases. This lead us to 
design the sampling of Xc, based on the conditional distribution of 
''^(dc,,c,\dcA,,c,)- See below for details. 

Generating atom A^, + i is similar to generating C,, only Ni+\ 
instead of C,- is placed on a circle Cn: 



Cn = {xeR^lsuch that||x-xc,,|| = Ac„iV, , , 

(2) 

and (x - xc,,)-(xc,, - xca.i) = cos 9/^,, + 1 }, 

where bcj,Nj_^i is the bond length between atom C, and atom 
A^z+i, and the distance between Xjv,/+i and xcj is 
fi^Wj^j.C/ = |x/vr,(+i — X(7,/|. Similarly, atom A^z+i is placed by 
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Figure 2. The time cost of energy calculations for generating one single loop. (A) The plot of computing time versus protein size show a 
large time saving of "Redcell-On" (red solid curve) compared to "Redcell-Off" (black dashed curve) for 12-residue loops, and (B) The plot of 6-residue 
loops. (C) Plot of computing time versus protein size show "Redcell-On" (red solid curve) has significantly improved computational time cost 
compared to "Ellipsoid-Only" (black dashed curve) and "Cutoff-Only" (green solid curve). 
doi:1 0.1 371 /journal.pcbi.1 003539.g002 



sampling dj 



conditional density Ti{dfi.^^ 



condition on dc^^Cj from the empirical 
^c^.C/)- repeat this process m 
times to generate m trial positions of C,, A^, + i, and C^, + i. 

Sampling dc/.c, and dp^.^^^c, from conditional distribu- 
tions. We sample <ic,,o from the conditional distribution 
T^(dc,,Ci\dcAi,Ci) to obtain the location of C,- atom. We first 
construct the empirical joint distribution n(dcAi,c,,dci,c,) by 
collecting (dcA,.Crdci,Ci) pairs over all loops in a loop database 
derived from the CulledPDB database (version 11118, at 30% 
identity, 2.0 A resolution, and with R = 0.25) [56] . From the 6,52 1 
protein structures in the CulledPDB, we remove 7 PDB structures 
which appear in our test data set. For the rest of 6,514 protein 
structures, loop regions were identified using the secondary 
structure information either directiy from the PDB records or 
from classification provided by the DSSP software [57]. AU 
random coil regions, including 0(-helices and /?-strands with length 
<4 amino acids, are included in our database. In total, we have 
49,336 loop structures. 

For each set of loops with the same residue separation (l — i), 
{dcAi,c,,dci,Ci) '^'"6 Winsorised at 99.9% level [58]. Specifically, the 
extreme values above 99.9% are replaced by the values at the 99.9 
percentile. We then use a nonparametric two-dimensional 
Gaussian kernel density estimator to construct a smooth bivariate 
distribution 7i(<ic^,.,Cp'^c,,C/) based on collected data. To estimate 
the probability density at a point u = {dcAi,Cn'^Q,Ci)s^^ , we use 
the observed n pairs of data from the database 
(xi, • • • x„) = ((dcA,,Ci,i 4c„c,,\)-, ■ ■ ■ {dcA,,C!,n,dci,Ci,n)) to derive 
the density function 7t(u), which takes the form of: 



7t(u)=-X^|Hrk[H4.(u-x,)l, 
n ^ — ' 



(3) 



where H is the symmetric and positive definite bandwidth 2x2 
matrix, K is a bivariate gaussian kernel function: 



K(x) = 



(4) 



To construct the bandwidth matrix H, we calculate the 
standard deviation oj^^^^ of the n pairs oi {dcAi,c,,dci,c,)- The 
corresponding entry hj^^^ in the bandwidth matrix H is set as 



'^dcA.cX-)^- Similarly, /i^, 



The bandwidth matrix H is then assembled as [59]: 



' n 



H = 



■li,<-l 

''Ci.q 



"CAjXl 



(5) 



We partition the domain oi(dcAj,Cpdci,Ci) into a grid with 32 grid 
points in each direction. n(dcAi,Ci^dcj,Ci) 'ii'e estimated at the grid 
points, and interpolated by a bilinear function elsewhere. 
Conditional distribution n{dcj,Ci\dcAj,Ci) is constructed from the 
joint distribution n(dcAi,Ci,dc^,Ci) when dcAj,Ci is fixed. t/c,-,C/ is 
sampled from n{dcj,Ci\dcAi,Ci)- We follow the same procedure to 
construct Jc(rfjv.^j,C/l'^c,,q)) which is used to sample dff.^^ Ci- 

Backbone dihedral angle distributions from the loop 
database. Although the empirical conditional distributions can 
efficientiy guide chain growth to generate properly connected loop 
conformations, the dihedral angles of the loops are often not 
energetically favorable. As a result, conditional distributions 
described above alone are not sufficient in generating near native 
loop conformations. 

The problem can be alleviated by an additional step of selecting 
a subset of n loops with low-energy dihedral angles from generated 
samples. We use empirical distributions of the loop dihedral angles 
obtained from the loop database. Specifically, for the m sampled 
positions of the current residue i of type a, with dihedral angles 
we select n<m samples following an empiri- 
cally derived backbone dihedral angle distribution 7t(^;,i//,-,a,). 
Here n{^i,\jjj,a.i) is derived from the same protein loop structure 
database for conditional distance distributions and constructed by 
counting the frequencies of (^,^) pairs for each residue type. 

Determining the number of trial states at each growth 
step for backbone torsion angles. It is important to 
determine the appropriate size of trial states m and n for 
generating backbone conformations, as small m and n values 
may lead to insufficient sampling, resulting in inaccurate loop 
conformations. On the other hand, very large m and n values wiU 
require significantiy more computational time, without significant 
gain in accuracy. 

We use a data set, denoted as tuning-set to determine the optimal 
values of parameters m and n for sampling backbone conforma- 
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Figure 3. Schematic illustration of placing C,- and A'^,-+i atoms. Atom C, has to be on the circle Cc. The position xc., of the C, atom of residue 
is determined by rfc,,c,. which is based on known distance dcAi.c, and the conditional distribution of Ji(dci.CiWcAi,c,)- Once dc.c, is sampled, C, can 
be placed on two positions with equal probabilities. Here Xcj is the selected position of C,. C',- (yellow ball) is placed at the position Xcj alternative 
to Xc.i- Similarly, the ^,+1 atom has to be on the circle Cn and its position x^j+i is determined by rfjv,^,,c, in a similar fashion. 
doi:l6.1371/journal.pcbi.1003539.g003 



tions. Part of this data set comes from that of Soto et al [42] . The 
rest are randomly selected from pre-compiled CuUedPDB (with 
<20% sequence identity, <1.8 A resolution, and _R<0.25). It 
contains a total of 140 loops, with 35 loops of length 6, 35 of length 
8, 35 of length 10, and 35 of length 12. 

The optimal values of m and n are determined as 
(to =160,71 = 32) according to the test result on tvming-set 
(Figure 4). 

Placement of backbone atoms. From the n sampled 
dihedral angle pairs ■ • • we can calculate the 

coordinates of atom C, and A^,+ i for all of the n trials. CAj^i 
atoms are sampled by generating random oj dihedral angles 
from a normal distribution with mean 180° and standard 
deviation of 4°. Calculating the coordinates of backbone O 
atoms using standard bond length and angle values is straightfor- 
ward. 

The coordinates of backbone atoms of the n samples at this 
particular growth step can be denoted as (x^ ,x J,. ,xjy.^ ^ , 

simplicity, we denote the coordinates of the four atoms at residue 
i as Si and the k-th sample as . We sample one of them using an 
energy criterion. The probability for is defined by 

7t(5f • • • ,S,_i)'-exp(-£(Sf)/r), 



where T=\ is the effective temperature, and E(Sf) is the 
interaction energy of the four atoms defined by Sf with the 
remaining part of the protein, including those loop atoms sampled 
in previous steps. The energy function E is an atomic distance- 
dependent empirical potential function constructed from the loop 
database, which is effective in detecting steric clashes and efficient 
to compute. Fragments with steric clashes are rarely drawn 
because of their high energy values. In summary, the coordinates 
of the four backbone atoms, 5, = (C,,0,,A', + i,C^, + i), is drawn 
from the following joint distribution at this step: 

5/~7l(rfq,cJ'^C.4,,C,)-7l(c?W,_|_i,cJfi?C,-,C,)-Jl(M)-7l(l^',l/'',ai) 

•7i(S,|S,,S(+i, • • • 

Altogether, (/ — /) backbone dihedral angle combinations need to 
be sampled. When the growing end is three residues away from the 
C-terminal anchor atom of the loop, C/, we apply the CSJD 
analytical closure method to generate coordinates of the remaining 
backbone atoms [12]. Small fluctuations of bond lengths, angles, 
and CO dihedral angles are introduced to the analytical closure 
method to increase the success rate of loop closure. 

Improving computational efficiency 

To reduce computational cost of calculating atom-atom 
distances in energy evaluation, we use a procedure, REsidue- 
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residue Distance Cutoff and ELLipsoid criterion (Redcell) to 
reduce computational time. 

Residue-residue distance cutoff. Tlie residue-residue dis- 
tance cutofTrfj; is used to exclude residues far from the loop energy 
calculation. Instead of a universal cutofT value, such as the 10 A 
Cfj — C/s distance used in reference [51], we use a residue- 
dependent distance cutoff value. The residue-residue distance 
cutofT d]( is assigned to be ri + rj + c, where r, and rj are the 
efFective radii of residue / and j, respectively. For one residue type, 
efTective radii is the distance between residue geometrical center 
and the heavy atom which is farthest away from the residue 
geometrical center, c is a constant set to 8 A. For a residue / in the 
loop region and residue j in the non-loop region, we calculate the 
residue-residue distance rf/, = ||x,- — x, ||, where x, and X/ are the 
geometric centers of residue / and j, respectively. If fl?//><i/;, all of 
the atoms in residue j are excluded from energy calculation. This 
residue-dependent cutoff is more accurate and ensures close 
residues are included. 

Ellipsoid criterion. The basic idea of ellipsoid criterion is to 
construct a symmetric ellipsoid such that all atoms that need to be 
considered for energy calculation during loop sampling are 
enclosed in the ellipsoid. Atoms that are outside of the ellipsoid 
can then be safely excluded. The starting and ending residues of a 
loop naturally serve as the two focal points of the ellipsoid. 
Intuitively, all backbone atoms of a loop must be within an 
ellipsoid. Formally, we define a set of points {x}, the sum of whose 
distances to the two foci is less than L, defined as the sum of the 
backbone bond lengths bc-c of the loop of length /: 



{x = (A-i,X2,X3)elR^| ||x-Xi|| + ||x-X2||<L}, 



L = 2a-- 



where Xi and X2 are the two focal points of the ellipsoid. The 
symmetric ellipsoid [b = c) can be written as: 



E 
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A m/n=3 
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Figure 4. Mean of minimum bacl<bone RiVISD values for 140 
protein loops. We generated 5,000 samples for each loop. The mean 
value of the minimum RMSD of the 140 loops (j'-axis) is plotted against 
the size of trial samples n (.Y-axis) for different choices of m. For control, 
results obtained without sampling torsion angles (m = n, control) are 
also plotted. The backbone (N, Ca, C and O atoms) RMSD in this paper is 
calculated by fixing the rest of the protein body. 
doi:1 0.1 371/journal.pcbi.1 003539.g004 
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Xi Xi X3 

2 ^ 1,2 



a- 



62 



llxi- 



= 1, (7) 

— ?l!)2]'/2 cQi-j-espond to 



where a = L/2 and h=[(L/2f-{ 

the semi-major axis and semi-minor axis of the symmetric 
ellipsoid, respectively. To incorporate the effects of side chain 
atoms, we enlarge the ellipsoid by the amount of the maximum 
side-chain length s. Furthermore, we assume that any atom can 
interact with a loop atom if it is within a distance cut-off of A:. As a 
result, the overall enlargement of the ellipsoid is {s + k). The final 
definition of the enlarged ellipsoid for detectmg possible atom- 
atom interactions is given by Eqn (7), with 



and 



a = (||xi-X2|l/2)secot2, 



6 = (||xi — X2||/2)tanai +s + k. 



where a.i is determined by the equation sec (Xi 
(5+fc) + (||xi-X2||/2)tanai 



l|Xl-X2|| 



(8) 



(9) 



-, and 0:2 



by tan 012 = 



'see Figure 5B). 



IIX1-X2II/2 

For any atom in the protein, if the sum of its distances to the two 
foci points is greater than 2a, this atom is permanently excluded 
from energy calculations. The computational cost to enforce this 
criterion depends only on the loop length and is independent of 
the size the protein, once the rest of the residues have been 
examined using the ellipsoid criterion. This improves our 
computing efficiency significantly, especially for large 
proteins. This criterion also helps to prune chain growth by 
terminating a growth attempt if the placed atoms are outside the 
ellipsoid. 

Side-chain modeling and steric clash removal 

Side chains are built upon completion of backbone sampling of 
a loop. For the i-th residue of type a,, we denote the degrees of 
freedom (DOFs) for its side chain as S(a,)- DOFs of side chain 
residues depend on the residue types, e.g. Arg has four dihedral 
angles {Xi,X2>X3jX4)> with {S(arg)='^)- Val only has one dihedral 
angle (Xi), with (si^vAL) = !)■ Each DOFs is discretized into bins of 
4°, and only bins with non-zero entries for all loop residues in the 
loop database are retained. 

We sample n^c trial states of side chains from the empirical 
distribution n{xi ■ ■ ■ ,) obtained from the loop database. One of 

rise trials is then chosen according to the probability calculated 
by the empirical potential. Denote the side chain fragment for 
the z-th residue as z,-, we select z, following the probability 
distribution: 



7i,(z,)~exp(-£"(z,)/r), 

where i?(z,) is the interaction energy of the newly added side chain 
fragment z, with the remaining part of the protein, and T is the 
efFective temperature. 

When there are steric clashes between side chains, we rotate the 
side-chain atoms along the Ca — Cp axis for all residue types except 
Pro. For Pro, we use the N — C^ axis for rotation. We consider two 
atoms to be in steric clash if the ratio of their distance to the sum of 
their van der Waals radii is less than 0.65 [13]. 
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Figure 5. Schematic illustration of ellipsoid criterion. (A) Three 
dimensional view of a point x locating on the ellipsoid constructed 
from the total loop length L and the two foci xi and X2. (B) Two 
dimensional view along through the xi-axis of the ellipsoid, with 

a = L/2 and 6 = f = [(L/2)--(iJ^^i^^)^]'/^ (dark gray), c is along x,- 

axis, not shown. The maximum side-chain length is denoted as s and 
the distance cut-off of interaction is k. The enlarged ellipsoid, which has 
updated a and h, is also shown (light gray). 
doi:10.1371/journal.pcbi.1003539.g005 



Potential function 

To evaluate the energy of loops, we develop a simple atom- 
based distance-dependent empirical potential function, following 
well-established practices [46,52,60-66]. Empirical energy func- 
tions developed from databases have been shown to be very 
effective in protein structure prediction, decoy discrimination, and 
protein-ligand interactions [54,63,64,67-71]. As our interest is 
modeling the loop regions, the atomic distance-dependent 
empirical potential is built from loop structures collected in the 
PDB [72]. 

Instead of using detailed 167 atom types associated with the 20 
amino acids, we group all heavy atoms into 20 groups, similar to 
the approach used in Rosetta [50]. The 16 side-chain atom types 
comprise six carbon types, six nitrogen types, three oxygen types, 
and one sulfur type. The 4 backbone types are N, Cc, C, and O. 
This simplified scheme helps to alleviate the problem of sparsity of 
observed data for certain parameter values. For an atom i in the 
loop region of atom type a, and an atom j of atom type fly, 
regardless whether j is in the loop region, the distance-dependent 
interaction energy E(a,,aj;d,j) is calculated as : 



E(aj.aj;djj) = 



-In 



7r(fl,-,fly; dy) 

n {aj,aj\ dij) ' 



(10) 



where E(ai,aj;dii) denotes the interaction energy between a 
specific atom pair {ai,aj) at distance dy, n(ai,af,dij) and 
71 (ai,ai;dij) are the observed probability of this distance-depen- 
dent interaction from the loop database and the expected 
probability from a random model, respectively. 

The observed probability n{ai,aj\dij) is calculated as: 



n(ai,aj; dy) ■- 



n{ai,aj\ dij) 

ntotal 



(11) 



where n(ai,aj; djj) is the observed count of (a,-,ay) pairs found in the 
loop structures with the distance dy falling in the predefined bins. 
We use a total of 60 bins for dy, ranging from 2 A to 8 A, with the 
bin width set to O.I A. djj ranging from 0 A to 2 A is treated as 

N 

one bin. Here n{aj,aj; dij)= n(a,,aj,dij{k)), where A' is the 

k = l 

number of loops in our loop database, n(ai,aj,dij{k)) is the 
observed number of {ai,aj) pairs at the distance of dy in the k-th 
loop, ritotal is the observed total number of aU atom pairs in the 
loop database regardless of the atom types and distance, namely, 
n,otal =J2Jm n(ai,aj; dy). 

dy Clj a; 

The expected random distance-dependent probability of this 
pair 71 {ai,aj; dy) is calculated based on sampled loop conforma- 
tions, called decoys. It is calculated as: 



71 {ai,aj;dy)-- 



n (ai,aj;dy) 



(12) 



J2 n (ai,aj,dy{x,k)) 



where n {ai,aj;dy)= J2 (" 



-) is the expected 



number of (a,-,fly; dy) pairs averaged over all decoy loop 
conformations of all target loops in the loop database. Here 
n {aj,aj,dy{x,k)) is the number of (a,,ay) pairs at distance dy in the 
x-th generated loop conformations for the k-th loop. M is the 
number of decoys generated for a loop, which is set to 500. is 
the number of loops in our loop database. w,o;a/ is the total number 



of 



all 



atom 



pairs 



the 



reference 



state, 



EEE« (ai,aj-dy). 



Tool availability 

We have made the source code of DiSGro available for 
download. The URL is at: tanto.bioengr.uic.edu/DiSGRo/. 

Supporting Information 

Text SI Results of modeled loops on Test Set 2-5, 
calculated using DiSGro. Table 1-3 are tables for Test Set 2. 
Table 4-12 are tables for Test Set 3. Table 13-18 are tables for 
Test Set 4. Table 19-22 are tables for Test Set 5. 
(PDF) 
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